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ADAPTIVE CONCENTRATION OF REGRESSION TREES, 
WITH APPLICATION TO RANDOM FORESTS 


By Stefan Wager and Guenther Walther 
Stanford University 

We study the convergence of the predictive surface of regression 
trees and forests. To support our analysis we introduce a notion of 
adaptive concentration for regression trees. This approach breaks tree 
training into a model selection phase in which we pick the tree splits, 
followed by a model fitting phase where we find the best regression 
model consistent with these splits. We then show that the fitted re¬ 
gression tree concentrates around the optimal predictor with the same 
splits: as d and n get large, the discrepancy is with high probability 
bounded on the order of ■\/log(d) log(n)/fc uniformly over the whole 
regression surface, where d is the dimension of the feature space, n is 
the number of training examples, and k is the minimum leaf size for 
each tree. We also provide rate-matching lower bounds for this adap¬ 
tive concentration statement. From a practical perspective, our result 
enables us to prove consistency results for adaptively grown forests 
in high dimensions, and to carry out valid post-selection inference in 
the sense of Berk et al. [2013] for subgroups defined by tree leaves. 


1. Introduction. Trees [10] and random forests [8] are among the most 
widely used machine learning predictors today, with applications in a broad 
variety of fields such as chemistry [42], ecology [13, 35], genetics [16, 40], and 
remote sensing [20, 34]. While allowing for flexible predictive surfaces and 
complicated interactions, trees and especially random forests have proven 
to be surprisingly resilient to over-fitting. Unlike competing non-parametric 
techniques such as kernel methods or neural networks, random forests require 
very little tuning; experience has shown that one can often obtain good 
predictive models out-of-the-box with standard software like randomForest 
for R [27]. However, from the perspective of existing results, we have no 
particularly strong reasons to believe that forest predictions ought to be 
well behaved in high dimensions: The best existing convergence results for 
random forests either only provide fixed-dimensional asymptotic consistency 
guarantees [10, 38], or assume a substantially simplified training procedure 
where tree splits are chosen using a holdout dataset [5, 14], 

Theoretical framework. The goal of this paper is to use adaptive concen¬ 
tration as a framework for describing the statistical properties of adaptively 
grown trees, i.e., trees that have access to the full training data while placing 
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Fig 1: Adaptive concentration compares the prediction surface of the fitted 
decision tree with that of the optimal decision tree with the same splits. 
Here, the splits produced by recursive partitioning are denoted by dashed 
vertical lines. The regression tree was fit using the R-package tree [44]. 


tree splits. The idea of adaptive concentration is to view training trees as 
occurring in two stages: a model selection stage where we decide on which 
splits to make, and a model fitting stage where we find the best regression 
tree conditional on having made these splits. We then treat the splits made 
by the tree as fixed, and show that the fitted regression tree is not much 
worse than the optimal regression tree with the same splits. In other words, 
we establish conditions under which sample averages over tree leaves L con¬ 
centrate to population averages over L —even if the leaves L were chosen 
after looking at the data. Figure 1 illustrates this goal for a one-dimensional 
tree. 

Our setting is closely related to the work of Berk et al. [4] on valid post¬ 
selection inference, which provides convergence guarantees for estimated lin¬ 
ear regression parameters that hold even if the regression model is selected 
after looking at the data. In order to make the connection explicit, suppose 
that we view trees as a form of adaptive regression, where each leaf L induces 
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an indicator feature for whether a given observation falls into the leaf. Our 
result then directly induces a method for constructing confidence intervals 
for the within-leaf mean responses that accounts for model selection in the 
sense of Berk et ah [4] (see Section 2.3 for details). 

Main results. We study an asymptotic regime where the dimension d of 
the feature space, the number n of training examples and the minimum leaf 
size k go to infinity together. Under mild regularity conditions, we show 
that regression trees satisfy an adaptive concentration bound that scales as 
Y^log (n) log {d)/k. More specihcally, with high probability and simultane¬ 
ously for any leaf L induced by any regression tree, the discrepancy between 
the sample-average response inside the leaf L and the population-average 
response inside L differ by at most C y^logTnJlog^dy/A:, where C is a univer¬ 
sal constant. In the context of Figure 1, this result translates into a uniform 
bound on the distance between the “htted” and “optimal” curves. We also 
show that this rate of convergence is tight to within a constant factor. 

Our result does not rely on major modifications to the random forest 
training routine, and in particnlar holds for variants of the CART algorithm 
[10] and the original proposal of Breiman [8]. The reason there is a depen¬ 
dence on n in the numerator of our bound is that, as the sample size grows, 
the trees comprising the random forest can become deeper and so the model 
family becomes larger. 

Finally, as an application of our adaptive concentration result, we estab¬ 
lish consistency guarantee for adaptively grown random forests—i.e., forests 
that do not use a holdout set for variable selection—that hold in a sparse, 
high-dimensional setting. To our knowledge, no directly comparable results 
are currently available in the literature (see Section 3 for details). 

Outline. This paper is structured as follows. We first state and discuss our 
main adaptive concentration bound in Section 2, and then apply this result 
to provide consistency results for high-dimensional random forests in Section 
3. Our theoretical work is carried out in Section 4, where we develop the 
necessary tools to prove the results from the first sections. Section 5 derives 
matching lower bounds for the adaptive concentration rate of regression 
trees. 

2. Adaptive Concentration. To give adaptive concentration bounds, 
we hrst need to disambiguate the hierarchy of concepts used to build forest 
predictors: a forest is an ensemble of trees, each of which relies on a partition 
of the data generated by a splitting rule. We begin by providing formal 
definitions of these quantities below; we state our main result in Section 2.2. 
Throughout our analysis, we assume that we have a set of n independent 
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and identically distributed training examples {Xi, Yi) satisfying Xj G [0, 1]'^ 
and Yi G [-M, M], 

2.1. A Review of Recursive Partitioning. The first concept underlying 
a regression tree is the splitting rule itself, which induces a partition A of 
[0, 1]“^ into non-overlapping rectangles Li, ..., Lm- We use the short-hand 
L{x) to denote the unique element of A containing x. We are interested in 
those partitions that can be obtained by recursive partitioning of the feature 
space [10]. Starting from a parent node u = [0, l]*^, recursive partitioning 
operates by repeatedly selecting a currently unsplit node u a. splitting 

variable j G {1, ..., d} and a threshold r G [0, 1], and then splitting n into 
two children = n n {x : Xj < r} and = u Cl {x : Xj > t}. The final 
leaf nodes generated by this algorithm, denoted by L, form a partition A 
of [0, 1]'^. Given our training set {{Xi, Yi)}, we require the partition to be 
valid in the sense of Definition 1 . 

Definition 1 (Valid partition). A partition A is {a, k}-valid if it can 
by generated by a recursive partitioning scheme in which each child node 
contains at least a fraction a of the data points in its parent node for some 
0 < a < 0.5, and each terminal node contains at least k training examples for 
some A; G N. Given a dataset X, we denote the set of {a, A;}-valid partitions 


hyVa,k{X). 


The constraint that each terminal node must have at least k observations 


is implemented by default in, e.g., randomForest. Meanwhile, the require¬ 
ment that each child node must incorporate at least a fraction a of the data 
in its parent—and thus that the tree cannot be excessively imbalanced—is 
more substantive. It is known that GART-like rules tend to split near the 
edges of noise features [10, 23]; thus our a-constraint may make splits along 
noise features less desirable. A similar assumption is also used by, e.g., [31]. 

A partition A can then be used to induce a tree predictor by averaging 
the responses Y) over its leaves Lj. In our adaptive concentration analysis, 
we consider two kinds of trees: valid trees (1) that are ht to the training 
sample, and partition-optimal trees (2) that would arise if we could train a 
tree supported on the partition A using the full population. 

Definition 2 (Valid and partition-optimal trees). A valid partition in¬ 
duces a valid tree 


(1) Ta : [0, 1]'^ ^ M, Ta (x) 


1 
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We denote the set of all {a, A:}-valid trees T\ with A G Va,k (A") by Ta,k (A”). 
Given a partition A, we also define the partition-optimal tree as 

(2) T; : [0, 1]'" ^ M, TX (x) = E [y|W G L(x)], 

where (Al, V) is a new random sample from our data-generating distribution. 

Forests are, as their name suggests, ensembles of regression trees. Gen¬ 
erating a regression forest involves growing multiple trees; then, the forest 
prediction is the average of all the tree predictions. In general, the choice of 
splitting splitting variables j is randomized to ensure that the different trees 
comprising the forest are not too correlated with each other. As shown by 
Breiman [8], the variance reduction of a forest in comparison with its con¬ 
stituent trees improves as the correlation between individual trees decreases. 

Definition 3 (Valid and partition-optimal forests). For any B G N, let 
r^(i), ..., r^(B) G (Af). Then, the average 


( 3 ) 








is a valid forest] we denote the set of {a, /cj-valid forests by T-La,k (A")- The 
corresponding partition-optimal forest is defined as 

1 ^ 

(4) : [0, 1]'^ ^ M, = -^Tl,,,ix). 

^ ^ ^ b=i 


When there is no risk of ambiguity, we write H := 


and H* = 

{Air 


There are many proposals for how to choose the splitting variables j and 
the thresholds r for trees. Our theoretical results, however, do not depend 
on the specific splitting rules used, and only rely on the generic structure of 
recursive partitioning; thus, we will not focus on specihc splitting rules in 
this paper. For a review of how trees and forests are implemented in practice, 
we recommend Hastie et al. [21] (see Ghapters 9.2 and 15). 

Remark: Bootstrapping. One way in which our forests from Definition 3 
differ from the original proposal of Breiman [8] is that we do not allow the 
individual trees to be evaluated on bootstrap samples.^ It seems plausible, 

^Technically, we could use a bootstrap sample to pick the partition A, but would then 
need to use the whole training set to turn A into a tree Ta. 
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however, that all our results should still hold even if we allow for bootstrap¬ 
ping, since the bootstrap is thought to have a regularizing effect on the 
forest and should thus reduce its ability to overht the training data [7, 11]. 
Studying the effect of the bootstrap on our adaptive concentration bounds 
and perhaps showing how it can improve adaptive concentration guarantees 
presents a promising avenue for further work. 

2.2. An Adaptive Concentration Bound. We are now ready to state our 
main result, given some assumptions on the problem setting. Our first con¬ 
dition is a bound on the dependence of the individual coordinates (W)j for 
j = 1, ..., d. 


Assumption 1 (Weakly dependent features). We have n independent 
and identically distributed training examples, whose features X G [0, 1]'^ 
are distributed according to a density /(•) satisfying < f{x) < C for all 
X G [0, 1]*^, and some constant C > 1- 

The above assumption is quite general. In contrast, consider the following 
condition: We have features Xi G that admit a density satisfying 

d d 

(5) n ^ ^ 

i=i i=i 

Although the above may at first glance appear weaker than Assumption 1, 
it is actually slightly more restrictive. Because trees are invariant to mono¬ 
tone transformations of the features {Xi)j, we can without loss of generality 
rescale the features such that Xi G [0, 1]'^ with uniform marginals. Applying 
this transformation to (5) would yield Assumption 1, along with a uniformity 
constraint on the marginals of /(•). 

Next, although we allow for the minimum leaf size k to be quite small, 
we still need for it to grow with n. Note that since k < n, the assumption 
below implicitly requires that log{d) <C n/log(n). 


Assumption 2 (Minimum leaf size). The minimum leaf-size k grows 
with n at a rate bounded from below by 


( 6 ) 


log(n) max{log(d), loglog(n)} ^ ^ 
n^oo k 


The following theorem is our main result on the adaptive concentration 
of decision trees. This result implies that, in practical data analysis, we can 
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treat the fitted prediction function Ta as a good approximation to the opti¬ 
mal tree Tl supported on the partition A. This result requires no continuity 
assumptions on the conditional mean function E [T | X = x]. 

Theorem 1. Suppose that we have n training examples (Xj, Yi) G [0, l]'^x 
[—M, M] satisfying Assumption 1, and that we have a sequence of problems 
with parameters (n, d, k) satisfying Assumption 2. Then, sample averages 
over all possible valid partitions concentrate around their expectations with 
high probability: 


lim P 

n, d, fc^oo 


(7) 


sup ITa (x) - Ta (x)| 
xe[o, i]'^,AeVc,i, 


< 9M 


A 


log (n/A:) (log(dfc)-|-31oglog(n)) 1 

y/k 


log (1 - 


-1 


= 1 . 


In a moderately high-dimensional regime with liminfd/n > 0, the above 
bound simplifies to 


(8) P 


sup |rA(x)-rx(x)| < 9M^ 
a:e[0, l]‘^,A6V^,fe 


log (n) log (d) 1 

log ((1 - ^ 


This result appears to be remarkably strong. As a baseline, suppose we 
just selected a single tree T G Ta,k (X) non-adaptively, i.e., without looking 
at the labels 1^. Then, a simple Hoeffding bound where we take n as a crude 
upper for the total number of leaves shows that 

2.1 log (n) ^ ^ 

k 

Moreover, assuming that k is not too large, say k < yAA, and that the tree 
is fully grown to depth k, the bound (9) is essentially tight. 

Comparing (8) with (9), we see that a uniform concentration bound valid 
for all possible trees in Ta^k {^) is only a factor 0(y^log(d)) weaker than 
the best concentration bound we could hope for with a single tree. In other 
words, the “cost” of adaptively searching over all valid trees in high dimen¬ 
sions is surprisingly low, and only scales with y^log(d). 

In Section 5, we provide a rate-matching lower bound to Theorem 1 that 
holds for d = for some r > 1. Thus, it is not possible to substantially 
improve on the above result without restricting the class of considered trees. 


(9) 


sup \T (x) — T* 

^/^rn lid 


(x)| < M 
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Finally, because Theorem 1 holds simultaneously for all trees T G T, we 
note that (7) and (8) also induce adaptive concentration bounds for random 
forests. The following result is conceptually related to the work of Biau et al. 
[6], who established conditions under which ensembles inherit consistency 
properties of their base classifiers, despite potentially allowing for greater 
representational power. 

Corollary 2. Under the conditions of Theorem 1, and assuming that 
liminfd/n > 0, we find that 


(10) P 


sup |iJW-iJ-W|<9Af 1 

a:e[0,1]'^ log ^(1 — a) 


1 . 


simultaneously for all valid forests H constructed as in Definition 3. 


Several authors have studied the theoretical properties of trees and forests 
in order to explain their performance [1, 5, 6, 9, 14, 19, 28, 31, 33, 37, 38, 39, 
45]. In particular, Scornet et al. [38] use low-dimensional asymptotics prove 
that Breiman’s original forests are consistent in terms of their predictive risk, 
while Biau [5] and Denil et al. [14] discuss the properties of some random 
forests where trees are grown using only a holdout set—i.e., without looking 
at the training data used for predictions. To our knowledge, however, our 
above result is the first convergence guarantee for the predictive surface of 
tree-based regression that holds in an asymptotic regime where n and d go 
to infinity together, and does not require the use of a holdout set. 


2.3. Adaptive Concentration as Post-Selection Inference. We can also 
interpret Theorem 1 directly as a result about statistical inference for adap¬ 
tive regression. To motivate this view, suppose that we build a tree-based 
model to do subgroup analysis on data from a clinical trial with the intent of, 
e.g., prioritizing treatment to patients in leaves with the largest estimated 
treatment effect [2, 41]. In this case, it is crucial to have valid post-selection 
confidence intervals for the treatment effect within each leaf that account 
for the adaptive model search done by the tree.^ 

At a high level, the problem of post-selection inference for trees is a special 
case of post-selection inference for adaptive least-squares regression. In the 
classical adaptive regression setup, we start with a design X G and 

^ Given an unconfoundedness assumption [36], results about regression trees and forests 
can directly be adapted to the problem of heterogeneous treatment effect estimation; see 
Section 4 of Wager and Athey [45] for an example. 
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a response Y € M”; the statistician then selects a model M C {1, d} 
and provides estimates of the form y = A^a-, where A := Xj^ C 
is the matrix comprised of the columns of X contained in the set Xi [e.g., 
4, 18, 26], 

In contrast, as in Definition 2, a valid tree first generates a partition A = 
and then averages the responses Yi inside each leaf Lj. Formally, 
this procedure is equivalent to creating a design matrix 

A G {0, 1 }"-^™' ^ where Aij = 1 {{Xi G Lj}), 

and then running linear regression on A and the response vector Y. In other 
words, regression trees are a form of adaptive regression where instead of 
subsetting a list of available variables, we build a design matrix by consid¬ 
ering indicator functions over leaves. 

For any such adaptively-constructed design A, we can define the optimal 
regression vector 

Pa = d-i where /i* = E [Yi | Aj] ; 


it is then natural to ask about the discrepancy between Pa — P% where 
Pa = {AXA)~^/iJY is the ordinary least squares estimate in the adaptive 
regression model. In the case of subgroup analysis with trees. Pa corresponds 
to the measured mean responses in each tree leaf, whereas P\ encodes the 
population averages over the same leaves. 

The problem of bounding the gap between Pa — Pa foi" sparse linear re¬ 
gression, i.e., where A = X _\4 for some feature subset A4, has been studied 
in detail by Berk et al. [4]. The authors show that it is possible to honestly 
account for arbitrary model selection by inflating standard confidence in¬ 
tervals for linear regression by a PoSI constant K that depends on X. For 
orthogonal designs, the optimal PoSI constant scales as K ~ ^2 log d; how¬ 
ever, there also exist designs for which K > ^fdj^. Moreover, computing the 
PoSI constant is in general difficult; the method used by Berk et al. [4] only 
works for d < 20. 

In contrast, our Theorem 1 shows that, for adaptively grown regression 
trees, \\Pa — /3^||oo = Op(Y^log(n) \og{d)/k)\ in other words, there exists 
a universal PoSI constant for trees on the order of Ayiog(n)k)g(d). This 
is somewhat worse than the PoSI correction for linear regression with an 
orthogonal design, but is exponentially better than general-case performance 
for sparse linear regression. 
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2.4. Adaptive Concentration vs. Generalization Bounds. There is a long 
tradition of studying the convergence of regression trees using arguments in 
the style of Vapnik and Chervonenkis [43]— including the work of Breiman 
et al. [10] and, more recently, [3, 15, 29, 30, 32]. The goal of these papers 
is to control the generalization error of regression trees, i.e., the difference 
between the training error and the expected test error of the tree. 

In our context, we can use [30] to obtain the following generalization 
bound: 


( 11 ) 


1 r 

sup -y^{Yi-T{Xi)f -E {Y-T{X)y 


i=l 

= Op 


# {leaves} (log (d) + log (n)) 


n 


= Op 


log (d) + log (n) 

k 


Strikingly, the rate of convergence in (11) is better than the adaptive concen¬ 
tration rate obtained in Theorem 1. Now, given our matching lower bound 
(see Theorem 15), we know that our rates are optimal in moderately high 
dimensions; thus, good adaptive concentration must be fundamentally more 
difficult than good generalization. 

The reason for this discrepancy is that generalization bounds only seek to 
control the global performance of regression trees, whereas adaptive concen¬ 
tration requires local control of the tree regression surface. To give a concrete 
example, suppose that d = for some r > 0. Then, the bound (11) implies 
that the empirical risk of a regression tree will be consistent for the test set 
error of the same tree provided that k S> log(n). Meanwhile, our adaptive 
concentration analysis implies that there exist trees with bad leaves unless 
k ;§> log(n)^; however, these bad leaves will be rare enough as to not affect 
the overall test set error of the tree. 

Generalization bounds vs. adaptive concentration bounds can both be rel¬ 
evant in different practical applications. If all we care about is the test error 
of a tree, then the bound (11) is more useful than Theorem 1. Conversely, 
if we want to use a regression tree to identify outlying sub-populations, 
then we need guarantees as in Theorem 1, whereas the rate (11) might be 
misleading. 


3. Consistency of High-Dimensional Random Forests. As one 

practical application of Theorem 1, we obtain consistency guarantees for 
random forests in a high-dimensional setting. We work in a regime where 
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n, d —>• cx), but the conditional mean function E [y | X = x] only depends 
on a small number of covariates: 


Assumption 3 (Sparse signal). There is a signal set Q G {1, d} of 
size \Q\ < q such that the set of random variables {{Xi)j : j ^ Q} is jointly 
independent of Yi and the set : j G Q}. 

We study a simple CART-like regression forest that can consistently esti¬ 
mate sparse signals in high dimensions, described in Procedure 1. Effectively, 
the algorithm is an extreme form of the procedure of Breiman [8] , where each 
splitting variable is chosen uniformly at random from {1, ..., d}. However, 
in a break from classical regression trees, our algorithm uses Theorem 1 to 
test the significance of a candidate split: if we have never split on a variable 
j yet, then our tree will only split along j if it leads to a large enough im¬ 
provement in mean-squared error. Once the tree has split along j once, then 
this feature is “unlocked” and can be subsequently used without testing.^ 
At a high level, our construction relies on a guarantee from Theorem 1 that 
no noise feature j will ever appear significant enough to get unlocked at any 
stage of the forest-generation process. 

To guarantee consistency of guess-and-check forests, we still need two 
additional conditions to hold. Assumption 4 is a weaker version of an as¬ 
sumption made by Scornet et al. [38] to ensure that trees can eventually 
accumulate evidence enabling them to split along signal coordinates. The 
continuity assumption is standard for consistency proofs [e.g., 5, 6, 31, 38]. 


Assumption 4 (Monotone signal). There is a minimum effect size /3 > 0 
and a set of sign variables aj G {±1} such that, for all j G Q and all 
X G [0, 1]'^, 


aj I E 


Yi I (Xi)—j — X—j, (Xi)j > ^ 


-E 


Yi I {Xi)—j — X-j, {Xi)j < ^ 


>P, 


where G [0, l]*^ ^ denotes the vector containing all but the j-th coor¬ 
dinate of X. 


Assumption 5 (Continuity). The function E [y | X = x] is Lipschitz- 
continuous in x. 

^Running a hypothesis test before accepting a split for a regression tree has also been 
considered by Zeileis et al. [47] ; however, the paper does not provide any formal consistency 
guarantees. 
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Procedure 1. Guess-and-Check Eorest 

Input: n training examples of the form {Xi, Yi), a minimum leaf size k, and a balance 
parameter a. 

Guess-and-check trees recursively apply the following splitting procedure until no 
more splits are possible, i.e., until all terminal nodes contain less than 2k training 
examples or there were no possible splits satisfying (12) to begin with. 

1. Select a currently unsplit node v containing at least 2k training examples. 

2. Pick a candidate splitting variable j € {1, ..., d} uniformly at random. 

3. Pick the minimum squared error splitting point 9. More specihcally, 


6 = argmax I (9) : — 


AN- {9) N+ {9) 


{9) 


(N- {9) + N+ {9)f 
such that 9 = {Xi)j for some Xi G v 

a\{i ■. XiGv}\,k<N- {9),N+ {9) 
where X{9)= ^ Yi / N+ - ^ Yi / N~ 

{i-.Xiei^(x), {Xi)j>e} {i:Xieiy(x), (Xi)j<8} 

N- (9) ^ \{i ■. Xi £ V, {Xi)j<9}\, 

N+ (9) = \{i : Xi G u, (Xi), > e}| . 


4. If either there has already been a successful split along variable j for some 
other node or 


( 12 ) 


£ 



log (n) log (d) \ 

log ((1 - a)"^) ) 


2 


the split succeeds and we cut the node u at 9 along the j-th variable; if not, 
we do not split the node v this time. 

A guess-and-check forest is the average of B independently generated guess-and-check 
trees. 


Given this setup, our first result is a uniform consistency result for guess- 
and-check forests in high dimensions; recall that, as in Theorem 1, we assume 
that the minimum leaf size k grows faster than log(n) log(ci). We take the 
signal dimension q to be fixed. 

Theorem 3. Under the conditions of Theorem 1 with liminfd/n > 0, 
suppose that y{x) are estimates for E [T | AC = x] obtained using a guess- 
and-check forest (Proeedure 1). Suppose, moreover, that Assumptions 3, f, 
and 5 hold. Then, 

lim sup |y (x) — E [y I X = xl I =0. 
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To our knowledge, this is the first pointwise consistency result for adap¬ 
tively grown random forests in high dimensions. Scornet et ah [38] prove 
that random forests are consistent; however, their proof only works in a 
regime where d is fixed while n —>■ oo. Meanwhile, in high dimensions, Biau 
[5] establishes a rate of convergence for random forests that only depends 
on the signal dimension q and not on d. However, his proof assumes a form 
of data splitting, where we first effectively do variable selection on a holdout 
sample, and then grow non-adaptive trees on the training sample without 
looking at the responses Y. 

We also consider the predictive error of our guess-and-check forests. To 
do so, we focus on a special variant of guess-and-check trees with “a = 0.5”, 
meaning that 6 in step 3 of Procedure 1 is always set to the median of the 
{Xi)j with Xi G v. The idea of building trees by splitting along leaf medians 
goes back to at least Devroye et al. [15], and the behavior of median forests 
in low dimensions has been studied in detail by Duroux and Scornet [17]. 

The following result hinges on showing that our median guess-and-check 
forest, with examples in [0, l]'^, converges as fast as a standard median 
forest with examples in [0, 1]'^. We can thus recover the rates of convergence 
of Duroux and Scornet [17] that only depend on q. In order to apply the 
result of Duroux and Scornet [17], we need to require the Xi to be uniformly 
distributed over [0, 1]“^, i.e., (" = 1 in Assumption 1. 


Theorem 4. Set a = 0.5, and define ^ = 1/(1 — 3/(4g)). Suppose that 
the conditions of Theorem 1 with liminfd/n > 0 and Assumptions 3, 4 oind 
5 hold, and moreover that Xi is uniformly distributed over [0, l]'^. Then, the 
excess error rate of the guess-and-check tree (Procedure 1) satisfies: 


(13) 


E 


{y{X)-E[Y\X]y 


O 



log(?) 
log(2 5) 


lag(g) 

given k x n'°s(2?). 


The result below is a direct analogue to the result of Biau [5] , except that 
our trees are adaptive, i.e., grown as usual by considering the Y sample; 
whereas his result required a holdout sample for variable selection. As noted 
by Biau, as soon as < 0.54 d, the rate (13) is better than the standard rate 
for non-parametric estimation of Lipschitz functions, i.e., 

Finally, we note that Assumption 4 was only used as a crude tool to 
ensure that the guess-and-check forest succeeds at splitting on all the signal 
variables; given Assumptions 3 and 5, our proof shows that guess-and-check 
forests will be consistent as long as the trees in fact split on the signal 
variables. Empirically, greedily trained regression trees have been found to 
be powerful under substantially weaker conditions than Assumption 4, so we 
expect a relaxation to be possible; however, we leave this to further work. 
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3.1. Proof Sketch. In this section, we briefly outline the main steps in 
proving our forest consistency results using Theorem 1. We begin by showing 
that, as n, d —)■ oo, guess-and-check trees (Procedure 1) never split on noise 
variables. We emphasize that the following result holds simultaneously for 
all random realizations of the guess-and-check forest. 

Lemma 5. Under the conditions of Theorem 1 with liminfd/n > 0, 
suppose moreover that Assumption 3 holds. Let TTbad be the probability that 
any guess-and-check tree ever splits on a noise variable j 0 Q. Then, nbad = 
0 ( 11 ^). 

In order to achieve consistency, we also need to make enough splits along 
the signal variables. The following lemma guarantees that, given Assumption 
4, a guess-and-check tree will almost always split on a signal variable when 
it gets a chance to do so. 

Lemma 6. Suppose Assumptions 3 and 4 hold. For any variable j for 
any j G Q, let vTj be the probability that the first time any guess-and-check 
tree tries to split along j, the split succeeds. Then -kj = 1 — 0{l/^yn). 

Given these two lemmas, we see that a d-dimensional guess-and-check 
tree with a g-dimensional signal effectively behaves like a g-dimensional tree. 
Then, to prove Theorems 3 and 4, it suffices to use existing results about 
the consistency of random forests in low dimensions; we in particular build 
on the work of Duroux and Scornet [17] and Meinshausen [31]. 

4. Theoretical Development. We now develop the technical machin¬ 
ery required to prove Theorem I. The bulk of our work goes into bounding 
large deviations of the process 

where L ranges over the set Ca,k of all possible leaves of a valid partition 
A G Va,k- Our argument proceeds as follows. 

First, in Section 4.2, we construct a parsimonious set of rectangles TZ such 
that any leaf L G Ca, k can be well-approximated by a rectangle R G TZ under 
Lebesgue measure A(-). More specifically, for any large enough L G 
we require that there exist i?_, G TZ such that R- C L C and 
e-^/^\{R+) < A(L) < ei/^A(i?_). 

Then, in Section 4.3, we establish concentration bounds for the process 
(14) that only depend on the cardinality of the approximating set TZ. The 
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main step here is in showing that the rectangles R- and constructed 
above are also good approximations to L under the empirical measure; for 
example, if R- is an inner approximation to L, then R- cannot contain too 
many fewer training examples than L. The technical results from Sections 
4.2 and 4.3 directly yield Theorem 1, as explained in Section 4.4. Proofs are 
given in the appendix. 

4.1. Notation. Throughout our analysis, we assume that we have n la¬ 
beled independent and identically distributed training examples {Xi, Yi) G 
[0, l]*^ X M. We denote rectangles R G [0, 1]'^ by 

d 

(15) i? = (g) 

i=i 

writing the Lebesgue measure of i? as A (R) 

d 

(16) 

i=i 

Recall that, by Assumption 1, the features Xi have a distribution /(•) sat¬ 
isfying < f{x) < ( for all X G [0, 1]'^. Given this setting, we write the 
expected fraction of training examples falling inside R as li(i?), and the 
number of training examples Xi inside R as #R: 

(17) fd(R)= f f(x)dx, #R=l{i:X,GR}l. 

Jr 

Notice that, marginally, ^R ~ Binomial (n, /r {R)). For any rectangle R, we 
define its support as 

(18) S{R) = |j G 1, ..., d-.rj / 0 or r+ / l| ; 

these are the features used in defining R. Finally, we write k for the set of 
all possible leaves associated with a valid partition A G Va,k- This notation 
is summarized in Table 1. 

4.2. A Set of Approximating Rectangles. Our first result effectively bounds 
the complexity of the space of rectangles over the unit cube by showing how 
all such rectangles can be well-approximated using an economical set of rect¬ 
angles TZ. We detail a constructive characterization of TZ in Section 4.2.1; 
this construction is a generalization of the one used by Walther [46] to study 
the asymptotics of the multidimensional scan statistic [24]. 


where 0 < < 1 for all j = 1, 
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Table 1 

Summary of notation. 


n 

Number of training examples 

d 

Dimension of feature space: X £ [0, I]'* 

k 

Minimum leaf size 

Q 

Bound on the allowable imbalance in recursive partitioning 

k 

Set of {a, fc}-valid partitions; see Definition 1 

k 

Set of all leaves generated by {a, fc}-valid partitions 

'Ta, k 

Set of {a, fc}-valid trees; see Definition 2 

hta, k 

Set of {a, fc}-valid forests; see Definition 3 

c 

Bound on the dependence structure of X; see Assumption 1 

\{R) 

Lebesgue measure of a rectangle R; see (16) 

fi{R) 

Expected fraction of training examples in a rectangle R; see (17) 

#R 

Number of training examples inside a rectangle R; see (17) 

S{R) 

Support of a rectangle R\ see (18) 


Theorem 7. Let S G {1, d} be a set of size IS"! = s, and let w, e G 
(0, 1). Then, there exists a set of rectangles 'R-s,w,e such that the following 
properties hold. Any rectangle R with support S{R) C S and of volume 
X{R) > w can be well approximated by elements in TZs,w,e from both above 
and below in terms of Lebesgue measure. Specifically, there exist rectangles 
R- , G TZs, w, 6 such that 

(19) R-TR(GRj^, and e“^A (ii+) < A (i?) < e^A (i?_). 


Moreover, the set TZs, w, e has cardinality bounded by 


1 /8s^ 

( 20 ) \Rs,w,e\ < ~ ^ ^^^2 


• (1 + O (e)). 


w \ 

In order to approximate all possible s-sparse rectangles, we use the set 


( 21 ) 


'R-S,W,£ - U|S|=s"^S, lU, £ 


of size 

(22) \ns,u,,e\ < Q - (tT + log2 


(1 + 0 (e)). 


Now, given our tree construction as encoded in Definition 1, each child node 
must be smaller than its parent by at least a factor 1 — a; thus, for any 
L G C, 4fL < (1 — n, and so 

< log (ra/#T) ^ log {n/k) 

' ^ log (1/(1-a)) - log (1/(!-«)) • 


(23) 
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This implies that, by setting 


(24) 


log {n/k) 
log(l/ (1 - a)) 


+ 1 ) 


we can use TZs,w,e to approximate all possible tree leaves L G to 

within error e under Lebesgue measure A(-), provided the leaves have vol¬ 
ume A(L) > w. We end this section with a useful bound on the size of the 
approximating set TZs,w,e- 


Corollary 8. Suppose that we set 

, , Ik 1 , 

(25) w = — -, £ = ^, and s = 
2C n y/k 


log {n/k) 


log(^(l-a) ^ 

where 0 < a < 0.5 and C > 1 are fixed constants. Then, 

1 (\T} . log(n/A:) (log(dA;)-F31oglog(n)) 

(26) 


+ 1 ) 


log((l-a)-) 
-I- O (log (max{n, d})). 


4.2.1. Constructing Approximating Rectangles. Without loss of general¬ 
ity, we can take S = {1, ..., s}; thus, our job is to e-approximate all rect¬ 
angles R G [0, 1]^ of volume at least w. When s = 1, it is easy to verify 
that we can construct an approximating set containing on the order of w~‘^ 
elements that e-approximate all possible intervals of length greater than 
w. we can build such a set by, e.g., considering all rectangles of the form 
[a • we/2, b ■ we/2] where a and b are integers. 

A naive extrapolation of this idea may suggest that, as s grows, the num¬ 
ber of required rectangles scales as w~‘^^'. this is what we would get by varying 
all the parameters rj and freely. However, as shown by the construction 
below, this guess is much too pessimistic. The reason for this is that the 
volume constraint \{R) = 0^=1 ~ ^ becomes more and more 

stringent as the dimension s grows, because every dimension along which 
r~ 76 0 or 56 1 geometrically cuts the size of \{R). For example, we can 
immediately verify that if \{R) > w, then 'a~j ~ ffi < 0-5 can can hold for 
at most log 2 (u’“^) coordinates. 

The construction below exploits the intuition that at most a few coordi¬ 
nates can be active on a small scale. Generalizing ideas from [46], we define 
TZ as the set of all rectangles of the form R = r^j, with 

(27) rJ = — and = min |l, r~ -|- w2'^^ -|- bj2'^^~^ —| , 
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such that 


(28) aj G 0, 1, .., 

2^-^ — 


WS- 

(29) Tj G 0, 1, ..., 

[log2u;“\ 


bj G 0, 1, 



In this construction, the j-th interval [r~, r^] is on the scale w2'^k The 
observation that only a few coordinates j can be active on a small scale is 
encoded in the lower bound (29) on lemma below confirms 

that this this approximating set is valid. 


Lemma 9. Given any rectangle R with support S{R) C S and volume 
A {R) > w, we can select rectangles R- and R+ satisfying (19) from the 
approximating set TZ defined above. 


To complete our characterization of the approximating set, it suffices to 
bound the cardinality of TZ. This computation is carried out in the Appendix, 
in the proof of Theorem 7. 


4.3. Uniform Concentration over Rectangles. In the previous section, we 
showed how to e-approximate all possible tree leaves under the Lebesgue 
measure on [0, 1]'^. However, to understand the behavior of decision trees, 
we do not want to approximate tree leaves in terms of Lebesgue measure, 
but rather in terms of the empirical measure induced by the training fea¬ 
tures which, given our assumptions, are uniformly distributed over 

[ 0 , 1 ]'^. 

The following result lets us get around this issue by showing that the 
empirical measure induced by the training examples is concentrated enough 
that, with high probability, the set TZs^w,e is also a good approximating set 
in terms of the empirical measure induced by the training data. 


Theorem 10. Suppose that Assumption 1 holds, and that we have a 
sequence of problems indexed by n with values of d and k satisfying Assump¬ 
tion 2. Let TZs^w,e be as defined in (21) with s as in (24), and choose e and 
w such that 


(30) 



and w 


k 

^ n’ 


where C, >1 is the constant from Assumption 1. Then, there exists an no G N 
such that, for every n > no, the following statement holds with probability 
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at least 1 — n for every possible leaf L G Ca,k, we can select a rectangle 
R G 'Rs,w,e such that RO L, A (L) < e^X{R), and 


(31) #L-#R< 3C'e#L + + O (log (|7^.,^,s|)) • 


We can then turn this result into a concentration bound on our empirical 
process of interest (14), provided we have a tail bound on the responses Yi. In 
the following result, we obtain such a tail bound by imposing a uniform sub- 
Gaussianity requirement on the conditional distribution Y | X Gi-,X). 
We define a conditional distribution G {-^X) to be uniformly sub-Gaussian if 
there is a constant M > 0 for which the following holds. For any X-marginal 
distribution X ~ Fx{') supported on [0, l]'^ and t > 0, the resulting Y- 
marginal distribution FV(-) = f G (■; X) dFx satisfies: 


(32) Eyr^Fr < 62 ^^^^, where m = [y]. 


Note that if Y is bounded by M, i.e., |y| < M almost surely, then (32) can 
immediately be verified using standard results. 

Corollary 11. Suppose that the conditions of Theorem 10 hold, that 
the parameters e and w are chosen as in (30), and that the conditional 
distribution ofY given X is uniformly sub-Gaussian in the sense of (32). 
Then, there exists an no G N such that, for all n > no, the following holds 
with probability at least 1 — y/n: 



(33) 


4.3.1. Proof Sketch for Theorem 10. Here, we present a series of technical 


results that lead up to Theorem 10. We begin with the following technical 


lemma, which follows from the Chernoff-Hoeffding concentration bound. In 
practice, we will always use Lemma 12 with TZ set to our finite approximating 



Lemma 12. Fix a sequence 5{n) > 0, and define the event 


(34) A : sup 


\#R-nyi{R)\ 

y/njAR) 













20 


WAGER AND WALTHER 


for any set of rectangles TZ and threshold /imin- Then, for any sequence of 
problems indexed by n with 


(35) 


lim = 0, and lim ^ = 0. 

n^oo n/Xmin n^oo |7c| 


there is a threshold no such that, for all n > no, we have P [^] > 1 — (5. Note 
that, above. A, TZ, fimin, and 6 are all implicitly changing with n. 


For now, the relation (34) is only valid for rectangles R contained in TZ, 
which we will take to be our finite approximating set TZs^w,e- In general, 
however, the leaves L G Ca,k we want to study will not be in TZs,w,e- The 
following result lets us move beyond this issue by providing a bound that is 
valid for all rectangles R, not just those in TZs,w,e- 

Lemma 13. Suppose that the event A defined in Lemma 12 has occurred 
with TZ = TZs,w,£ and fimin = C'OJ, where C is the constant from Assumption 
1. Then, all rectangles R with g, {R) > C^w satisfy: 

( \TZs, W, £ 


#R<e 


(R) +62 (R) log 


Finally, in order for the above result to be useful for understanding the 
leaves of decision trees, we need to show that all possible leaves L will satisfy 
the condition p (L) > (w; the result below gives us such a guarantee. With 
these results in hand, proving Theorem 10 reduces to algebra. 


Corollary 14. Let A be the event from Lemma 12 with TZ = TZs , w ,£ 
and 6 = Xj^/n, and define the parameters s, w and s as in the statement of 
Theorem 10. Then, there exists an no G N such that, for n > no, 

(36) inf {p (L)} > <fw on the event A. 

4.4. Proof of Theorem 1. We have now gathered all the ingredients re¬ 
quired to prove Theorem 1, which follows from combining Corollary 8 with 
Corollary 11. In fact, the hrst bound (7) follows directly from these two 
results by noting that sup {11)1 | Xj G Lj < M (since 1) is bounded by as¬ 
sumption) and that 2 + 4\/3 < 9. 

Now, if moreover lim inf d/n > 0, then we can verify that log(n) log(d) — 
log(n/A:) log((iA;) > 0 for large enough n, and so we can use Corollary 8 to 
bound 

(37) log(7^s,n,£) < log W ^ log{d)}). 

log(l - a) 
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Thanks to our assumption that liminf djn > 0, the remained term is negli¬ 
gible and (8) holds. 

5. Lower Bounds. In this final section, we complement our main adap¬ 
tive concentration bound, and show that the convergence rate given in (8) 
cannot be improved. Lin and Jeon [28] have also studied lower bounds for 
forest convergence; however, they only consider non-adaptive forests and so 
their lower bounds are substantially weaker. We show the following: 


Theorem 15. For any r > 0, set d := d{n) = and let a < 0.2. 

Then, there exists a distribution over {X, Y) and a sequence k{n) satisfying 
the conditions Theorem 1 for which 


(38) 


lim P 

n^oo 


sup |Ta (x) - (x)| > 

xe[o,i]'*,AeVc,fe 


M /log (n) log (d) 

Tv k 


= 1 . 


Whenever, liminf d/n > 0 (i.e., r > 1), the rate (38) has the same de¬ 
pendence on n, d, and k as the upper bound (8), thus implying that our 
adaptive concentration bound from Theorem 1 is rate optimal. 

To establish Theorem 15, we take the T) to be i.i.d. and independent of 
X , with P [li = M] = P [Yi = —M] = 1/2. We will construct N = N{n) 
nodes Li,... ,Ln C Va, k{^) and then consider for j = 1,..., N: 


(39) T,- 


^ ^ Yi, and Tj 


1 


E 


where Tj is an approximation to T* built using auxiliary random variables 
Yi generated as 


(40) Yi = Yi\Zi\ where Z, ~ AA (0, 1). 

Notice in particular that the Yi are jointly distributed as independent Gaus¬ 
sian random variables with variance M^. 

The idea of the proof is to construct a large set of candidate leaf nodes 
whose pairwise intersections are small enough that the multivariate normal 
distribution of the standardized Tj has correlations that are bounded away 
from unity; note that the leaves Lj may be generated by different trees, and 
are thus allowed to overlap. A normal approximation lemma [25] then allows 
us to stochastically lower-bound the distribution of maxj Tj in terms of the 
distribution of a correlated multivariate normal that can be constructed in 
a simple way from an i.i.d. normal sequence. Specifically, we establish the 
following lower bound of the tail of the Tj. 
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Lemma 16. For any r > 0, set d := d{n) = \rF\, and let a < 0.2. Then, 
there exists a sequence k = k{n) satisfying Assumption 2 and a set of N 
{a, k}-valid candidate leaf nodes Lj chosen independently of the Yi and Yi, 
for which 


(41) 

(42) 


lim P 

n^oo 
log (AT) 


max To 


> 1.999 M 



log (n) log {d) 
log(5) 


( 1 + 0 ( 1 )). 


log(iV) 

k 


1, and 


In the second step, we establish a coupling between Tj and the Ti that is 
tight enough to guarantee that the approximation error maxj{Tj — Tj} is 
smaller than maxj Tj. To get this coupling, we use the following bound on 
the moment-generating function of Yi — Yi. 


Lemma 17. Let P [T = 1] = P [T = —1] = 1/2 and Z ~ iV(0,1) inde¬ 
pendent of Y. Then 


(43) 


E [exp {f (y — y|Z|)}] < exp 


( 1 - 1 /+^ 


for t in a neighborhood of zero. 

This lemma implies the bound on maxj{Tj — Tj} given below. The lower 
bound in Theorem 15 then follows from Lemma 16 and Corollary 18, to¬ 
gether with the observation that 1.999 y^ 2/5 — 1 > -y/log 5/5. 

Corollary 18. Suppose that the statistics Tj and Tj are constructed 
as in (39 - 40), with leaves Lj chosen independently of the Yi and Yi. Then, 

log (A^) ^ Q 

k 
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Remark. In this appendix, we present our proofs in the order of logical 
dependence instead of the order in which they appear in the main text. For 
example, Theorem 7 depends on Lemma 9, so we prove the lemma first. 


APPENDIX A: THE APPROXIMATING SET OF RECTANGLES 

A.l. Proof of Lemma 9. We focus on showing how to construct R+; 
the construction of R- is analogous. Recall that, given a rectangle 


R = 


S 



j=i 



our goal is to select a rectangle 



from TZ such that R C R_^ and A (R+) < e^A (i?). In order to guarantee this, 
it is sufficient to check that, for all j, 

^ ^ 4 ’ ^ - ^7) • 

Now, for each j, define 


log2 


r- — r- 
J J 


w 
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let qj be the largest choice of the form (28) such that q~ < rj, and pick q^ 
analogously. These choices define a rectangle in TZ such that R C R_^_. 
By construction, we immediately see that 


2Ej=iD > 2 




log2 


r . — r . 

j— 1 —1 




Moreover, by definition of tj 

2'^iw < — rj < 


thus, we can verify that 


W£ 


j- j- T 1 fJJC 1 £ 


r: — r • 
j 3 


This implies that 
and so \R+\ < e^|ii|. 


A.2. Proof of Theorem 7. Given Lemma 9, in order to complete the 
proof of Theorem 7 it suffices to bound the cardinality of the approximating 
set defined in Section 4.2.1. To do so, we first observe that for fixed values 
of {tj}, the number of possible choices for the {aj} and {bj} is bounded by 


n(' + 

i=i 


2I-D — 

W£. 


1 + 


'2s' 

£ 




< (Id'' 2* 

\W£^ 


n 


r ■ — r • 
3 3 


w 


(1 + 0 (e)) 




W \ £ 


because 0^=1 yf ~ loosely bound the number of 

possible choices for {tj} by (l + log 2 w~^)^, yielding the desired bound. 
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A.3. Proof of Corollary 8. Given the parameter choices (25), we can 
verify that 


log\TZs,w,e\ < log 


d\ I f 8s 


s / w \ s 


1 + log2 


(1 + 0 (e)) 


= log + 2s log [e + 2s log(s) 
+ s log log (rf + O (log (n)) 


Meanwhile, 


log (“) < s log (d) = + O (log (<i)). 

Vj log ((1 - a)-i) 

s log (e"^) = ^ ^og log W Q (iQg ^ 

2 log ((1-a)-') 

s log (.), . log log (rc-) = log(nA)l ^lo g H ^ ^ , 

log((l-a)-'l 


Combining these results, we recover (26). 


APPENDIX B: CONCENTRATION OVER RECTANGLES 

B.l. Proof of Lemma 12. The proof of this result relies on a union 
bound. Our goal is to show that, for any rectangle R with R £ TZ and 
/i (R) > /imin, we can bound the large deviations of #R as follows: There is 
some no G N such that, for all n > no. 


(45) 

(46) 



fi{R) 


> A 



where 



Verifying (45) then immediately implies the desired bound on P [A]. We 
proceed in two parts. Eirst, we verify that (45) holds for very large rectangles 
with fi{R) > 1/2; second, we consider the smaller rectangles with 1/2 > 

+ {R) A +min- 

In the case of large rectangles, we know that ^R/n is sub-Gaussian with 
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parameter cj^ = l/(4n). Thus, 


-/X i? 

n 


> A 


< 2 exp [—2n 


< 2 exp 


-31og|® 


< 2 


-V 

1^1 y ’ 


and (45) is easily satisfied. 

In order to analyze small rectangles, we need the tighter binomial con¬ 
centration result of Chernoff [12] and Hoeffding [22], stated below for con¬ 
venience. 


Proposition (Chernoff-Hoeffding). Let be a binomial (n, fi) ran¬ 
dom variable. Then 


(47) 

(48) 




n 


> /i -|- A 


- < /i — A 

n 


< 


< 


\ /i+A 


L y f 1 

/x + Ay vi-^-A 

T ~ / 1 

/x-Ay vi-^-i-A 


1—/i—A'' 


Now by (35), we know that A/fi(R) —?• 0 uniformly over our set of rect¬ 
angles of interest. Moreover, because we are working with small rectangles 
R, we also have A/(l — /x(P)) < A//x(P) —)• 0. Finally, we can verify by 
calculus that 

for all |x| < 0.5. 

1 -\- X 

Thus, for large enough n, the Chernoff-Hoeffding bound implies that for all 
our rectangles of interest. 


-A /X + A 

n 


A 


+ 


< exp 


+ 


n 

A2 


1-M 2(l-/x)2 (1-^)3 

A2 


A A2 a3\ 

A3 

+ t:; -r? 1 (1 “ /X — A) 


< exp 


—n 


2/x(l -/x) 


(1 + 0 ( 1 )) 


We can also apply the same argument to (48), and get for large enough n: 


-1 

I+: 

1 

■& 

> +A 

< 2 exp 

n 




A2 


—n 


< exp 


—n 


2/x(l - fi) 
A^ 

3/x(l - ^) 


(1 + 0 ( 1 )) 


< 


|77|’ 


thus concluding the proof. 






































30 


WAGER AND WALTHER 


B.2. Proof of Lemma 13. Because /i (R) > /imin, we can use Assump¬ 
tion 1 to verify that A (R) > ^ Thus, by Theorem 7, we know 

that there exists a rectangles R+ £ TZg, w, e such that 

RCR+, and {R+) <X{R). 

Moreover, again by Assumption 1 and because R C we can verify that 


^l{R+)<^^{R)+c{x{R+)-xm 

<l^{R) + C (e" - 1) A (R) 
<^^{R) + C' (e^ - 1) ^ (R) 

< (R) • 


Then, we see that on A, 


#R < #R+ < rifj, {R+) + J3n/x {R+) log ( 


\R 


■S, W, £ \ 


< {R) + 62 (i?) log ^ 

where the second inequality followed by Lemma 12. 


V 

1*^5, IP, £ I 


B.3. Proof of Corollary 14. By Lemma 13, we see that on A 
sup {#i? : // {R) = (w} < e^^’^C,nw + ea ‘iQnw log ^ 



for large enough n. In other words, for large enough n, all rectangles of size 
Qw can have at most 3^/4 points in them. Thus, we conclude that, on A, 
any rectangle with k points must have size greater than ^w. 


B.4. Proof of Theorem 10. For this whole proof, we assume that the 
event A defined in Lemma 12 has occurred, with TZ = Tls,w,£, jJ-min = Cw/2, 
and 6 = Xj\/n. Note that, thanks to Assumption 2, these choices satisfy the 
conditions (35), and so Lemma 12 implies that the is an no G N such that 
the event A must occur with probability at least 1 — Xj\/n for all n > uq. 

Given the event A, we first note that Corollary 14 implies that /i (L) > Cw 
for all L G Ca^k] by Assumption 1, this also implies that A(L) > w for all 










ADAPTIVE CONCENTRATION OF TREES 


31 


L G Ca^k- Thus, by Theorem 7, for each possible leaf L G Ca,k we can select 
a rectangle R G TZs,w,£ such that R O L and 

A (L) < e^X (R). 

This establishes the first part of our desired result. 

Next, we need to control the counts and ^R on A. First, by Lemma 
13, we immediately see that, on A, 

nfi (L) + 62 ‘^^^y^31og \/nfi (L) - #L > 0, 

or equivalently that 

n,. (L) > 5^ - y '3 log + y'siog + 4#L^ , 

and so, because e —)• 0 and we are in a regime where log (|77s,«,,e|/(5) 'C #L, 
we find that 

UM (i) > #i - + O (log ^ . 

Meanwhile, by Assumption 1, 

/i(i?) > /r(L) - C (A(L) - A(i2)) > ^ (L) - C (1 - e-^) A(L) 

> (1 - (1 - |U(L) > //min, 

where the last inequality is valid provided e is small enough (i.e., n is large 
enough). Thus, because R G 'Rs,w,£, we can use Lemma 12 to verify that, 
on A, 

#R > nfj, (R) — 
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where the third inequality relies on £ being small enough, and the fourth 
inequality relies on a second application of Lemma 12. Chaining these in¬ 
equalities, together, we find that, on A and provided that n is large enough, 


(l + e 


-3C^£ 


'3#Llog 


\'^S, W, £ 


-I- O (log 




Finally, by Assumption 1, e = 1/Vk <C 1/ \/\og{\R\/5), and moreover we 
know that <C \'R-s,w,e\'-, thus, the expression simplifies to 


#L - <3C £#L + 2 W31og 




#L + 0(log(|7^,,^,,|)). 


B.5. Proof of Corollary 11. Let A be the “good” set used in the 
proof of Theorem 10; recall that P [^] > 1 — l/\/u for large enough n. 
Now, for any leaf L generated by a valid tree, let R G be the inner 

approximation for L constructed in Theorem 10. By the triangle inequality, 


sup 


1 

#L 


Y, Yi-¥.[Y\XeL\ 


{i.Xi£L} 


:L£C 


< sup 


U E y.-in E 




{i.X,£L} 


Y. 


{i:Xi&R} 


L£C 


+ sup 


1 

Tr 


Y [^1^ ^R] -R^R^ 

{r.Xi£R} 

+ sup{|E[y|AGi2]-E[y|A:GL]| :Lg£} 


:,#R>k 


We can now proceed to bound each term individually. Starting with the first 
term, we note that because i? C L 


— V Ti 


1 

¥r 


E 


{r.Xi£R} 

and by Theorem 10, on event A, 

#L-#R 


< 2sup{|yi| : w G A} 


sup 


#L 


:L e C} <3C £ + 2 


3 log |7^, 


S,W,£\ 


k 


+ 0 


*l-#r 


log \ Rs, W, £ 

k 
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Meanwhile, in order to bound the last term, we note that our uniform sub- 
Gaussianity condition implies that, for any points xi and X 2 , |IE [F | X = X 2 ] — 
E [y I X = xi] I < 2M. To see this, apply (32) with F{X) = ^6 {{X = xi})+ 
^(5 {{X = X 2 }). Thus, we find that 

|E [y|x e R]-E [y|y e l]| < 2M ^~ ^ < 2mc^ (i -e"^), 

[L) 

where the last inequality used Assumption 1 and the fact that A(-R) > 
e-^A(L). 

Finally, thanks to our uniform sub-Gaussiantiy assumption (32), we see 
that—conditionally on —then mean of the 1) over R is sub-Gaussian 
with parameter = M‘^/^R. Moreover, by the proof of Theorem 10, we 
see that the is an no C N such that ^R > k/2 on the A whenever n > uq. 
Thus, conditionally on A and provided that n > no, the following event B 
holds with probability at least 1 — 


B : sup 


#R 


Yi-E[Y\X€R] 


{i-.Xi&R} 


< Mi 


12 log (2|7^|V^) 


k/2 


To verify this fact, we apply a union bound over the set of rectangles R £ 
'R-s,w,e with j/R > k/2. Combining all these bounds yields (33). 


APPENDIX C: RANDOM FOREST CONSISTENCY 

C.l. Proof of Lemma 5. By expanding the square, 4 Y+/ (A^“ + 
X+)2 < 1, and so l{9) < {9). Now, by Assumption 3, for all j 0 Q, 

A* {9) := E [y, \Xi£u, {Xi)j > 0] - E [y^ | E n, (A^),- < 0] = 0. 

Moreover, we constructed the tree such that the sets {i : A, E n, (Aj)j < 9} 
and {i : Xi £ n, {Xi)j > 9} both have at most k observations. Thus, by 
Theorem 1 and Corollary 11, 


\A{9)-A*{9)\ = \A{9)\<2x9M 

with probability at least 1 — 0{l/^/n), uniformly over all possible nodes u 
with at least 2k observations and all variables j 0 Q. We conclude that, with 
probability tending to 1 — 0{l/y/n), (12) is never satisfied for any j 0 Q, 
uniformly over all nodes of all trees that can be generated as guess-and-check 
trees. 


log (n) log {d) 
/clog (^(1 - a)"^^ 










34 


WAGER AND WALTHER 


C.2. Proof of Lemma 6. Let v be the current node considered by 
the algorithm, let j G Q, and suppose that the node v has never yet been 
cut along the direction j. To establish the desired result, it suffices to show 
that the split will succeed with probability tending to 1, uniformly over all 
j € Q and all possible nodes v with at least 2k observations that have not 
yet been divided along j. Our goal is to show that, with high probability, 
the split ai 0 = 1/2 satisfies (12). Since the actual splitting point 9 is chosen 
by maximizing i{6), the result then also holds for 9. 

We first note that, by Assumption 4, 


A* 



E 


h) I Wj G u, (Aj)j > 


-E 


Yi \ Xi ^ u, {Xi)j < — 


>( 3 . 


Thus, by Theorem 1 and Corollary 11, and by Assumption 2 on the minimum 
leaf-size k, we see that with probability at least 1 — 0{l/y/n) 

>/32 + o ( 1 ) 


uniformly over all possible nodes. Next, by Lemma 13 and a similar argu¬ 
ment, we again find that with probability at least 1 — 0{l/^/n) 

4iV_(l/2)iV+(l/2) 

-O — 1 “I” 1 ) ? 

(lV_(l/2) + iV+(l/2))2 

uniformly over all possible nodes. Thus, (12) is in fact satisfied at 9 = 
1/2 with high probability, and the split will succeed with high probability 
(although the split may not actually occur at 9 = 1/2, as there may be even 
more significant potential splitting points). 


C.3. Proof of Theorem 3. We start by defining the following event 
£: “all trees in the forest never split on any variable j 0 Q, and always split 
on any variable j £ Q when j is drawn in phase 2 of the guess-and-check 
procedure.” On this event, our d-dimensional guess-and-check tree is equiv¬ 
alent to a g-dimensional guess-and-check tree supported on the coordinates 

j e Q- 

From Corollary 2, we already know that 

sup 

xe[o,i]'^ y k J 


where H{x) is our guess-check-tree and H*{x) is the corresponding partition- 
optimal forest in the sense of definition 3. Moreover, from Lemma 5 and 
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Lemma 6, we know that P [£^] —)• 1. Thus, to obtain uniform consistency, it 
only remains to show that, conditionally on T, 

sup \H*{x)-E[Y\X = x]\=Op{l). 
xelo, i]‘i 

Now, let T* be a single partition-optimal tree comprising H*. Because 
E [y I X = x] is Lipschitz in x, and because T*{x) = E [Y \ X ^ L>{x )], we 
see that 

|r*(x) — E [y I X = x] I < Clip diam (T(x)), 

where Clip is the Lipschitz constant, and the diameter diam((L(x)) of the 
leaf L(x) is dehned as the longest line segment contained inside L(x). This 
implies that 


\H*{x) - E[Y\X = x] \ <ClipE^ [diam (L(x))] , 

where E* is an expectation over all trees comprising the forest. Finally, by 
Lemma 2 of Meinshausen [31], we see that on event £ 

sup E* [diam(L(x))] = Op(l), 

thus concluding the proof. We note that, in his paper, Meinshausen [31] 
only discusses convergence at a single x; however, his proof is based on a 
Kolmogorov-Smirnov argument that in facts holds uniformly for all x. 

C. 4. Proof of Theorem 4. We again dehne the same event £ as in 
the proof of Theorem 3. This time, since the tree effectively always splits at 
the middle of a randomly selected feature j G Q, our tree on event £ is in 
fact equivalent to a non-adaptive median tree trained on the feature set Q, 
where each splitting variable is selected independently at random. This is 
exactly the class of forests studied by Duroux and Scornet [17], who showed 
that (13) in fact holds for them. 

Now, thanks to Lemma 5 and Lemma 6, the probability of the event 
£ failing decays to 0 at a rate Xj^/n. Thus, because our responses Y are 
bounded, the risk accrued from failures of £ is bounded on the order of 
C)(l/\/n), which is vanishingly small relative to the rate in (13). Thus, we 
conclude that (13) in fact holds for guess-and-check trees with a = 1/2. 

APPENDIX D: LOWER BOUNDS 

D. l. Proof of Lemma 16. Eor any 0 < a < 0.5, we study a-random 
partitions generated as follows. Given dn = as assumed in Lemma 16, 
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we set 


S — Sfi 


max 




log^(n) 

n 


(logo) ^ 


and k = kn ■= [na'^"J . 


Then k > log^(n) for all n > 100, so Assumption 2 is met. Now, for each 
index set 5 C {1,..., d} with |5| = s, define the partition A 5 as follows: For 
each j = 1, n, in order, if j G S', then recursively split each leaf along the 
j-th feature in such a way that each child-node contains at least a fraction a 
of the data points in its parent-node, with one child-node containing exactly 
a fraction a up to rounding. Given this construction, each terminal node L 
has at least > a^n > k, so As is in fact an {a, /c}-valid partition. 

Furthermore, the above construction provides for one terminal node L 
that satisfies 


< a^n + < {k + 1) + - < A: -|- 3. 

i=o 

For each s-combination S of {l,...,d} construct A 5 as described above. 
Then for each of the N := s-combinations the resulting partition has 
one terminal node with the above properties. Denote these N terminal nodes 
by Li, ..., Lat; so for f = 1, ... , N, k < ^Li < A; -|- 3. Moreover, if i / j 
then the splits in Lj and Lj occur on axes that differ in at least one index. 
Since X has independent marginals we get 

(49) E [^(Lj n Lj')\ < q;K [^Tj] -|- 1 < ak 3. 

Since the overlap between the Lj is not very large, we might hope that 
the maximum of the Tj would be of comparable size to the maximum of 
N independent Gaussian random variables with variance The 

following sub-result shows that this intuition is valid, at least to within a 
factor 


Proposition. Assume that dn = rf for some r > 0. Then, given the 
nodes Lj and statistics Tj as constructed above, and for any rj > 0, 

(50) 


lim P max T,- 
n^oo j=l,...,N 


> (l-r/)Mv^ 


a)l 


/log(n) log(d) 1 
log(a“i) ^/k 


1 . 


The claim from Lemma 16 follows directly from (50) by noting that, 
following our construction, an d-random partition is always an {a, A:}-vahd 
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partition for any a > a. Thus, if we know that a < 0.2, we can apply (50) 
for 0.2-randoni partitions, thus yielding the desired bound. 

We now proceed to proving the sub-result. Independence of X and Y 
implies that the Tj are standard normal, and so, for j ^ m: 


Cov 


Tj, -v /#Ljn T„ 


< ME 


, n Lm) 

= me ‘ ^ - 

#{Lj n Lm) 


k 


< M { a + - 


by (49). Now, let Zi, ..., Z^+i be i.i.d. M{0, M^) and for j = 1, ..., N set 


Zj .— \/l Oifi Zj T -y/ cXji Z , with .— cr -t- . 

Kn 


The Zj are marginally normal with variance M^, and Cov[Zj, Zm] = oinM‘^ 
if j 7 ^ m. Using Corollary 4.2.3 to the normal approximation lemma of 
Leadbetter et al. [25], and the fact that -^/^Ljjk < 1 + 2/fc, we get for 
every u > 0: 


max VkTj < u 




(51) 


< 


< 


max Ji^Lj Tj < [l + - \ u 


max Z, < 1 + - u 

j=i,...,N ■' \ k 


max Z,- < 


(1 T fc)^ yjOinZN+1 

\/l Ofn 


Setting u = Un := (1 — rfjM^2{1 — a) A/log(n) log(d)/log(a ^), our goal is 
to show that the above probability converges to 0. First, observe that 


2un — r]/2 

, \(^nZM+l E .. Un 
Kn, i 


1 


^-r] 

because, by assumption, dn = [n^J whereas kn log(n)^. Thus, by (51), it 
suffices to verify that 

l-ry/2 Ur, 


(52) E 

Then, noting that 


max Z, < ,_ 

^ - 1-7? 


0 . 


log(A^) = log 



slog(d) (1 -E o(l)) 


log(n)log(d) 
log (a-i) 


(1 + 0 ( 1 )), 


we can use a standard Gaussian tail bound to check that (52) in fact holds. 
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D.2. Proof of Lemma 17. One readily checks that 


E 


exp|t(y - P|Z|)| = exp(^^t2 + +exp(^^t2 - 


where $ is the cdf of Z. Using the expansion 


2 A:!2fc(2A: + l)’ 


we find that 


w ^ ■ w N (-l)^t2fe+i 

= - V ; '‘“"f** E mHih + i) 


“ t“‘ 


E 

k=0 


{2k)\ 


k=0 

^ ^2k+l 


E 

\k=0 


E 

\k=0 

= 1 + + ff*’ + 0{t% 


i-m 


k^2k+l 


{2k + l)\ 


1 


k\2>^{2k + l) 


Lemma 17 follows since 


4! 


< 




D.3. Proof of Corollary 18. For simplicity, we take M = 1, and so 
Yi G zbl and Var[Fi] = 1. Q standard argument using Markov’s inequality 
gives for any t,v > 0: 


V¥Lj(T,-Tj)>v\x 

< (E [exp{f (Yi|Zi| - Yi)}])*^^ exp tuj 

/ 

< exp < — 

4(i-Ui 


by Lemma 17, provided that t := vj{2^Jjj^Lj{l — ^^2/^i)') is small enough. 
Now, set V = Vn ■= rjy/log N for some hxed rj > 0. First, recalling min^ > 
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kn > log^ n, we verify that in fact t —)■ 0. Thus, 


max^ ^ V kn [Tj - Tjj > ?7Vlog 


< N max E P 
j L - 


] > '^n X 


< exp < (log A^) 1 — 


4 1 - J2/tt 


which converges to zero provided that > 4(l — y^2/7r) > 0.8. 



